reg= require("d3-regression")
polyData=transpose(nonlinearData)
viewof degree = Inputs.range([1, 9], {step: 1, label: "Model Order", value: 1})
polyRegression = reg.regressionPoly()
.x((d) => d.x)
.y((d) => d.y)
.order(degree)
.domain(d3.extent(polyData, (d) => d.x));
polyCurveRaw = polyRegression(polyData)
polyCurve = polyCurveRaw.map(([x, y]) => ({ x, y }))
polyPredict = (x) => {
if (typeof polyRegression.predict === "function") return polyRegression.predict(x);
if (polyCurveRaw.coefficients) {
return polyCurveRaw.coefficients.reduce((acc, coeff, i) => acc + coeff * x ** i, 0);
}
return NaN;
}
polyResiduals = polyData.map((d) => {
const fit = polyPredict(d.x);
return { x: d.x, y: d.y, fit, resid: d.y - fit };
})
polyMaxAbsResid = d3.max(polyResiduals, (d) => Math.abs(d.resid)) || 1;5 Model Building and Selection
In this module, we explore strategies for building effective regression models. We cover: - Increasing model complexity to capture non-linear relationships and interactions, - Assessing model fit while balancing complexity, - Techniques for selecting parsimonious models.
5.1 Increasing model complexity to model complex relationships
Why increase model complexity?
We began this course with the simple linear regression model: a single continuous predictor with a straight-line effect,
Key-point: The ‘first-order’ single predictor model
Is another name for the simple linear model from ?sec-slm \[
E[Y] = \beta_0 + \beta_1 X.
\]
This is termed a ‘first-order model’ because \(x\) only appears once, ‘as is’ (you’ll understand why this terminology is used shortly) but the relevant point is that this is the simplest linear model we can make with \(X\).
This simple model is often a reasonable starting point, and can go a long way to modeling real-world phenomena (especially in the multiple regression case), but real data may also show structure a straight line cannot capture, such as curvature.
Example 1: Non-linear relationships in data
For example, fitting a straight line to the following data is not ideal:
Square, Cubic, and Higher-Order Univariate Models
We enrich the basic linear model by adding powers of a predictor, allowing the fitted relationship to bend. In the above case, the “U” shape suggests a quadratic (squared) term may be appropriate
Second-order (Quadratic) univariate model
\[E[Y] = \beta_0 + \beta_1X + \beta_2X^2\] ::: Example #### Seeing how \(\beta_2\) bends the curve Adjust the slider for the squared term to see how changing \(\beta_2\) adds curvature to the fitted relationship (with optional tweaks to \(\beta_0\) and \(\beta_1\)). Try to find a good fit to the data (can you guess what model generated this data?).
:::
Third-order (cubic) univariate model
\[ E[Y] = \beta_0 + \beta_1X + \beta_2X^2 + \beta_3X^3 \]
allows more complex curvature with two
N-th order univariate model
We can extend this process of deriving new terms by adding further powers of \(X\) - allowing ius to to fit arbitrarily complex curves: \[ E[Y] = \beta_0 + \beta_1X + \dots + \beta_nX^n. \]
Note that each term gets its own parameter (\(\beta_i\)). If we have \(n\) data points, a (n-1)^th-order model will have a parameter for each point, meaning it will fit the data perfectly!
Example 2: Fitting higher-order polynomial models
Higher-order terms increase flexibility, but also increase the risk of fitting random noise rather than meaningful structure. We will return to this important issue later in the module.
Note: The meaning of “linear” in linear models
Even though higher-order models can model nonlinear relationships between the outcome and predictors, they are still linear models because each parameter enters additively. “Linearity” here refers to the parameters, not the shape of the fitted curve.
Fitting higher-order univariate models in R
In R, higher-order polynomial terms can be included using the I() function to indicate ‘as is’ operations. For example, to fit a quadratic model:
lm(y ~ x + I(x^2), data = nonlinearData)
Call:
lm(formula = y ~ x + I(x^2), data = nonlinearData)
Coefficients:
(Intercept) x I(x^2)
2.6185 -3.3760 0.8218
or using the poly() function:
lm(y ~ poly(x, 2), data = nonlinearData)
Call:
lm(formula = y ~ poly(x, 2), data = nonlinearData)
Coefficients:
(Intercept) poly(x, 2)1 poly(x, 2)2
0.3493 0.3941 3.0208
Interaction Models with Continuous Predictors
In ?sec-slr we saw a multiple regression model with two continuous predictors:
\[ E[Y] = \beta_0 + \beta_1X_1 + \beta_2X_2. \]
This again is a first-order model - each predictor appears only once, ‘as is’. Of course, we can add higher-order terms for each predictor separately (e.g., \(X_1^2\), \(X_2^3\)) to capture curvature in their individual effects as we did above.
Now however, we can also consider a different form of higher order term: what happens when we multiply predictors together? This gives us an interaction term - a term that combines two (or more) predictors.
For two continuous predictors, the second-order interaction model is:
Key-point: Second-order interaction model
\[
E[Y] = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_1X_2.
\]
If \(\beta_3 = 0\) (i.e. as in the first-order model), predictors act independently of one another. If \(\beta_3 \neq 0\), however, the effect on \(E[Y]\) of changes in \(X_1\) varies with \(X_2\), and vice versa.
Example 3: Fitting an interaction model to the advertising dataset.
If we recall our plot of the advertising dataset from ?sec-mlr, our fist order linear model did not fit the data in particular ways:
ads <- read.csv("Data/Advertising.csv")
fit <- lm(Sales ~ TV + Radio, data = ads)The plane does not fit the data at the edges - it overestimates sales when either Radio or TV advertising is low (separately), but underestimates sales when both are high. This suggests that the effect of increasing one type of advertising depends on the level of the other type - an interaction effect.
We can fit a second-order interaction model to capture this: \[ E[Sales] = \beta_0 + \beta_1TV + \beta_2Radio + \beta_3(TV \times Radio). \]
fit_int <- lm(Sales ~ TV * Radio, data = ads)The interaction effect can also be visualised by fixing one predictor and plotting the relationship between the other predictor and the outcome. In this case, we can plot the lines representing the relationship between TV advertising and Sales at different when Radio advertising is low (e.g. 0 units), average(i.e. ), and high:
When our first order model has more than two predictors, we can include interaction terms between any pair (or more) of predictors. For example, with three predictors \(X_1\), \(X_2\), and \(X_3\), a second-order interaction model would include interactions between each pair: \[ E[Y] = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \beta_4X_1X_2 + \beta_5X_1X_3 + \beta_6X_2X_3. \]
Higher-order interaction models
We can also consider higher-order interaction models that include products of three or more predictors. For example, a third-order interaction model with three predictors would include the three-way interaction term: \[ E[Y] = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \beta_4X_1X_2 + \beta_5X_1X_3 + \beta_6X_2X_3 + \beta_7X_1X_2X_3. \] This term captures how the interaction between two predictors changes depending on the level of the third predictor.
As in the case of higher-order univariate models, we can extend this idea to include both higher-order interaction terms and higher-order univariate terms for each predictor, allowing for very flexible modeling of complex relationships.
5.2 However, as before, increasing model complexity in this way raises the risk of overfitting and interpretability challenges, which we will discuss later in this module.
Fitting interaction models in R
In R, interaction terms can be included using the * operator in the formula. For example, to fit a second-order interaction model with two predictors:
lm(Y ~ X1*X2*X3, data = mydata)This expands to include both main effects and the interaction term. To include only the interaction term without main effects, use the : operator. For example, if we only wanted the interaction between X1 and X2, along with the linear terms for X1, X2, and X3:
lm(Y ~ X1+X2+X3+X1:X2, data = mydata)Interaction Models with Categorical Predictors
A categorical variable with \(k\) levels is represented by \(k-1\) dummy variables: \[ E[Y] = \beta_0 + \beta_1X + \gamma_1D_1 + \dots + \gamma_{k-1}D_{k-1}. \]
- Each \(\gamma_j\) measures the difference between level \(j\) and the baseline.
- Changing the baseline changes interpretations but not fitted values.a
Categorical predictors may also interact with continuous predictors: \[ E[Y] = \beta_0 + \beta_1X + \gamma_1D_1 + \delta_1(XD_1), \] allowing each group to have its own slope.
Note
Earlier tools such as t-tests and one-way ANOVA are special cases of linear models with categorical predictors. Regression provides a unified framework that handles categorical, continuous, and interaction effects together.
5.3 Assessing model fit and model complexity
Why not increase model complexity?
Adding terms always reduces residual error numerically, but may:
- Introduce unnecessary noise,
- Create unstable estimates,
- Reduce interpretability,
- Increase sensitivity to outliers,
- Encourage overfitting.
A good model is no more complex than necessary to describe the main structure of the data.
Multicollinearity
Multicollinearity occurs when predictors are highly correlated, leading to:
- Inflated standard errors,
- Unstable coefficient estimates,
- High sensitivity to small data changes.
Diagnostics include:
- Correlation matrices
- Variance Inflation Factors (VIF)
- Condition numbers
Example 4
If weight and engine displacement in a dataset are nearly perfectly correlated, including both may cause unstable coefficient signs and large standard errors.
Fit Metrics
\(R^2\)
Proportion of variation explained by the model. Always increases when predictors are added.
Adjusted \(R^2\)
Penalises extra predictors. Increases only when a new term improves explanatory power beyond chance.
Information Criteria: AIC, BIC
AIC balances fit and complexity: \[ \text{AIC} = -2\ell + 2k. \] BIC penalises complexity more heavily: \[ \text{BIC} = -2\ell + k\log(n). \] Lower values indicate better trade-offs.
Note
AIC is often preferred for prediction-oriented modelling; BIC tends to select simpler models.
5.4 Methods for building parsimonious models
Stepwise Regression
Stepwise methods provide automated ways to search for simpler models.
Forward Selection
Begin with a minimal model (often the intercept). Add predictors one at a time when they improve AIC or adjusted \(R^2\).
Backward Selection
Begin with a saturated model containing all candidate predictors. Remove predictors that do not meaningfully contribute.
Note
Stepwise procedures should be treated as screening tools, not definitive modelling strategies. Final model decisions should consider diagnostics, interpretability, and subject-matter knowledge.
5.5 Exercises
Exercise 1
A scatterplot of \(Y\) vs. \(X\) shows curvature.
Fit a straight-line model and inspect residuals.
Fit a model including \(X^2\).
Compare AIC and adjusted \(R^2\) between the models.
Discuss whether the added complexity is justified.
:::